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Abstract 

Interval arithmetic is hardly feasible without directed rounding as provided, for 
example, by the IEEE floating-point standard. Equally essential for interval methods 
is directed rounding for conversion between the external decimal and internal binary 
numerals. This is not provided by the standard I/O libraries. Conversion algorithms 
exist that guarantee identity upon conversion followed by its inverse. Although it 
may be possible to adapt these algorithms for use in decimal interval I/O, we argue 
that outward rounding in radix conversion is computationally a simpler problem than 
guaranteeing identity. Hence it is preferable to develop decimal interval I/O ab initio, 
which is what we do in this paper. 

1 Introduction 

Interval arithmetic endows every computation with the authority of proof — the theorem 
being that the real- valued solution x belongs to a set of reals [a, b], where a and b are IEEE- 
standard floating-point numbers. Yet it can easily happen that we get as output something 
obviously wrong such as [0.33333,0.33333] when x = 1/3. It may well be that a 
correct interval has been computed. For example, if [a, b] is the narrowest single-length 
IEEE-standard floating-point interval containing 1/3, and if we do not allow the output to 
be truncated, we get [a, b] as 

[0. 333333313465118408203125, 
0. 3333333432674407958984375], 

which is best written as 

0. 3333333 [13465118408203125, 4 32 674407958984375], 
a notation proposed and analyzed in [9 ]. This is the unabridged version of 
[0 .33333, .33333] , 
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which is what we get with a typical default precision. The untruncated decimal numerals 
are exact representations of [a, b] because every binary floating-point number can be repre- 
sented as a decimal numeral. However, if one wants to ensure exact representation, then we 
get a decimal for every bit; not an economical representation. Then we might as well write 
the bits themselves, which gives 

0.010101010101010101010101 [0,1]. 

On output the only problem is that any reasonable choice of display precision causes 
the standard output routines to shorten the numerals for the bounds to the same result. The 
improvement needed here is output that is aware of whether an upper or a lower bound is to 
be displayed. 

On input, however, we have a more serious problem. Suppose we want to initialize an 
interval variable at 0.1. Initializing the lower and upper bounds with 

lb = ub = . 1 

is guaranteed to generate an interval that does not contain 0.1, for the simple reason that 
there is no floating-point number equal to 0.1. Thus the best the compiler can do is to 
produce one of the floating-point numbers closest to 0.1. We don't know which one it 
is. The table in Figure Q] shows for each of 1/2, 1/3, ... , 1/10 that the compiler picks the 
upper bound of the narrowest interval containing the fraction concerned. To remind us that 
we cannot count on this to happen, the other choice is made for 1/11. Thus, for input we 
need an algorithm that produces, for every fraction or numeral as input, a narrow interval 
containing it; ideally the narrowest. 

Floating-point numbers are normalized as a power of 2 multiplied by a mantissa that is 
between 1 and 2, like this: 

2"(-2) * 1 . 0101010101010101010101 [0, 1] . 

This shows the single-length format, which has 23 bits after the binary point. 

Long strings of bits are usually shown in hexadecimal notation, which uses the 16 char- 
acters 

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f 

to represent four consecutive bits at a time. As there are only 23 bits to be displayed, we 
represent the first three bits in octal notation, with the characters 

0, 1, 2, 3, 4, 5, 6, 7. 

Thus the 23 bits after the binary point are shown as an octal character followed by five 
hexadecimal characters. Thus, the narrowest interval containing 1/3 is 

2~(-2) * 1 . 2aaaa [ a, b] . 

In this way we get the table in Figure Q] 

In this paper we develop algorithms to support software that performs correct input and 
output of intervals. 
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Figure 1: Why standard input cannot be relied on to obtain an interval for some of the 
common fractions. 

2 Previous work 

The problems described in the introduction have motivated Rump [ 8 ] to develop a method 
for decimal input and output for intervals. He computes two 2-dimensional arrays low and 
upp of double-length floating-point numbers that satisfy 

low[d, e] < d x 10 e < upp[d, e] 

for d € {1, 2, ... j 9} and for e S {—340, —339, . . . , 308}. These two arrays contain a total 
of 11682 double-length floating-point numbers. 

With these precomputed numbers available, a decimal numeral 0.did2 • • • dk x 10 e is 
converted on input to the interval 

k k 
[^low[di,e-i],^upp[di,e-i]] (1) 

i=i i=i 

of double-length floating-point numbers, where the inner brackets enclose array subscripts. 
The additions for the lower (upper) bound are performed in downward (upward) rounding 
mode. As a result of the possibly occurring roundings, the interval obtained is not in general 
as narrow as possible. To minimize the inevitable widening, Rump recommends performing 
the additions starting with the smallest array elements. 

The work of Rump should be compared and contrasted with what we will call here 
general-purpose conversion algorithms. Though these are not specifically intended for in- 
terval conversions, they can perhaps be adapted to this purpose because of their guarantees 
on accuracy. 

Let us briefly review these algorithms. Steele and White [5] formulated identity re- 
quirements for conversions between internal and external numerals. The internal identity 
requirement is that conversion of an internal numeral to external and back results in the 



same. If the external numerals are unlimited in length, this requirement can always be met. 
Similarly, the external identity requirement is that conversion of an external numeral to in- 
ternal and back results in the same. Usually, the internal numerals have a fixed length, so 
this requirement can only be met when the external numeral is not longer than is warranted 
by this fixed length. 

Every binary numeral is representable as a decimal one. Hence, if the internal numerals 
are binary and the external ones are decimal, then it is trivial to satisfy the internal identity 
requirement by using the decimal equivalent of the internal binary numeral. The problem 
addressed by Steele and White is to do so with as few decimal digits as possible. 

To transfer binary floating-point numerals from one computer to another by means of 
binary files, one needs to be assured that these files are written and read in a compatible way. 
Moreover, the layout of the floating-point numerals of both machines needs to be the same. 
Adherence to the IEEE floating-point format is not enough: in addition, both machines have 
to be big-endian or both need to be little-endian. Because of these difficulties it is attractive 
to convert internal binary numerals to an external text file containing decimal numerals and 
to have an algorithm that guarantees faithful translation back to internal format. 

Such use of decimal I/O requires that the conversion is efficient. For this reason, Gay 
[3] and Burger and Dybvig [1] devised a faster decimal output. The identity requirement 
assumes sufficiently accurate decimal input. This was addressed by Clinger [2] and by Gay 
0. 

Compared to Rump's approach, this has the advantage of not requiring a database of 
precomputed numbers. 

The identity requirements satisfied by (5] [3l CD are convincing for the purposes en- 
visaged by these authors. However, the requirements for decimal interval I/O are equally 
compelling and quite different: 

1. On input, compute the narrowest floating-point interval containing the decimal nu- 
meral. 

2. On output, compute the narrowest interval containing the floating-point number that 
is bounded by decimal numerals of specified length. 

To satisfy these requirements it seemed simplest to develop our algorithms from first prin- 
ciples rather than to attempt to adapt the work referenced above. 

From [4] it seems that XSC does the conversions between binary and decimal numerals 
correctly, but the authors do not say how it was done. Although the source code of XSC is 
publicly available (GPL license), it contains more than 150 files. Moreover, the code is not 
documented in such a way as to facilitate finding the pieces of code that do the conversion, 
to gather them, and to understand how the conversion was done. 

Our task, then, is to explain how to do the conversion, and to show how to implement 
it. As the utility of this kind of work requires executable programs, we needed to pick an 
implementation language. The primary language for numerical work is Fortran. However, 
this is more in the direction of systems programming, for which C/C++ is a reasonable 
choice of language. 



3 Decimal input 



The "scientific notation" for a number consisting of a decimal fraction and an exponent is 
almost universally used. Thus it would seem that one only has to cater to this format for I/O 
routines. Yet for input there is a strong case to be made for the pre-scientific notation of a 
fraction as a pair of integers. Therefore we consider these two in turn. 

3.1 Rational fractions 

In many situations the most convenient way to input a number is as a rational fraction 
p/q, where p and q are integers. Not only is it convenient, but there is also a convincing 
correctness criterion: as there exists, in a given format, a unique least floating-point interval 
that contains p/q, the input function should yield this interval. 

An algorithm for this purpose is one that has been widely, but not universally, taught 
to children for at least two centuries. We will illustrate this algorithm with p/q = 3/7. As 
this number is less than 1, we know that the binary fraction has the form - . . . . How do 
we get the missing digits? Multiply by two to get 6/7. As the result is less than one, the 
result has the form 0.0... Multiply by two to get 12/7. As the result is not less than one, 
the result has the form 0.01... and subtract 1 from 12/7, so that we continue with 5/7. 
Double again, get 10/7, so that we continue with 3/7 after noting that the result has the 
form 0.011. . . We already know what comes after that, so we have determined that the 
binary equivalent of 3/7 is 0. 011 011 011 ... 

Let us check our computation. The first group of digits after the decimal point is worth 
3/8. Every next one is worth 1/8 times the previous group of three. So the value of the 
infinite string is (3/8) * (1 + (1/8) + (1/8) 2 + ...). Let S = 1 + (1/8) + (1/8) 2 + . . .. 
Then S/8 = (1/8) + (1/8) 2 + . . . = S - 1. Hence S = 8/7 and we find that the infinite 
string is (3/8) * (8/7) = 3/7. 

Let us now translate this procedure to a machine-executable algorithm. We can decide 
whether the next binary digit should be a or a 1 by computing in a variable pwr (from 
"power") the successive powers 2~ fe for k = 0, 1, 2, . . . and adding some of these powers 
in a variable named f rac (from "fraction"). We compute the next value of pwr at every 
step, but only add it to f rac when deciding to write a 1 in the algorithm described above. 
This idea is embodied in the following code. 

float frac = 0.0; 

// fraction to be built up out of powers of two 
float pwr = 1.0; // power of 2 to add to fraction 
// let r = p/q 

while (frac + pwr > frac && p > 0) { 
// p > and r - frac = (p/q) *pwr 
pwr = pwr/ 2.0; p = 2*p; 
if (p >= q) { 

p = p-q; frac = frac+pwr; 

} 

// if p = then r = frac 

// if p > then < r - frac <= pwr 



} 



Typically, there are infinitely many binary digits. Only the first of a finite segment can be 
accommodated in a floating-point number. At every iteration pwr is halved. At some point 
this quantity becomes insignificant. The criterion for this is whether adding pwr to f rac 
makes any difference to f rac. As soon as that is not the case, the iteration terminates. 

The last assertion states that every time around the loop we have that f rac is a lower 
bound for p/q and that the difference between the two is at most 2~\ where i is the number 
of times around the loop. This code builds in f rac a lower bound to p/q that approaches 
p/q as closely as the precision allows. If p/q has can be represented in the floating-point 
number format, this shows by p becoming 0. 

The above segment of code is the kernel of the function shown in Figure |2] for the IEEE 
standard 754 single-length format. It produces the least floating-point interval that contains 
the fraction p/q. 

Interval* convert (int p, int q) { 

// Assumes p and q are positive and less than 2" {30}. 
// Returns the greatest float that is not greater than 
// the rational r = p/q. 

float sf = scaleFactor (p, q) ; 

float frac = 0.0; 

// fraction to be built up out of powers of two 
float pwr = 1.0; // power of 2 to add to fraction 
while (frac + pwr > frac && p > 0) { 

// p > and r - frac = (p/q) *pwr 

pwr = pwr/ 2.0; p = 2*p; 

if (p >= q) { 

p = p-q; frac = frac+pwr; 

} // p > and r - frac = (p/q) *pwr 

} 

frac = frac*sf; 

if (p==0) // r = frac 

return new Interval ( frac, frac) ; 
else return new Interval ( frac, next (frac)); 



Figure 2: A C++ function to convert a rational p/q to a floating-point number. The function 
next ( ) produces the least floating-point number greater than its argument. 

The function assumes that 0.5 < p/q < 1, which is of course not in general the case. It 
therefore needs the function in Figure [3] 

3.2 Input of decimal floating-point numerals 

Many programming languages and data files use a similar format for fractional numbers. 
Though details may vary, it is easy to extract from such files an integer e containing the 



float scaleFactor ( int& p, int& q) { 
// Let r = p/q. 

// Returns a scale factor sf such that p/q is in [0.5,1) 
// and r = (p/q) *sf 

float sf = 1.0; // the scale factor 

// r = (p/q)*sf; assume q < 2~{30} to avoid overflow 

while (q > p) { p = 2*p; sf /= 2.0; } 

// r = (p/q)*sf and q <= 2*p; 

// assume p < 2" {30} to avoid overflow 

while (p >= q) { q = 2*q; sf *= 2.0; } 

// r = (p/q) *sf and p < q <= 2*p; 

// therefore 0.5 <= (p/q) < 1 

return sf; 

} 

Figure 3: A C++ function to scale the fraction p/q. 

exponent and a string containing the decimal digits d\ , . . . , d n of the fraction, where d\ ^ 0. 
We assume that these source code or data file elements are intended to denote the rational 
number r = 10 e Yli=i dil0~ l . The function we aim at here takes as input e and d\,...d n 
and returns the narrowest floating-point interval containing r. 

Such numerals can be treated analogously to the rational fractions discussed in Sec- 
tion [3j] A difference is that instead of subjecting a rational fraction p/q to repeated dou- 
bling possibly combined with subtracting 1, we do this with a string of decimal digits rep- 
resenting the mantissa of the number to be input. 

A more significant difference compared to the rational-fraction case is the presence of 
a power of 10. We need to convert to binary not only the mantissa, but also the input power 
of 10. This happens in a preliminary stage we call binarization. 

For example, suppose we desire to convert 0.0123 to binary. According to our assumed 
convention, we have e = — 1 and have d\ , . . . d n in the form of the string "12 3". 

In this example the power of 10 is exchanged with power of 2 and a corresponding 
change in mantissa as follows: 0.123 * 10" 1 = 0.246 * lO^ 1 * 2" 1 = 0.492 * 10" 1 * 2~ 2 = 
0.984 * 10" 1 * 2~ 3 = 0.1968 * 2" 4 . 

To implement binarization, we need a function to double a decimal mantissa given 
as a string of decimal digits. Each digit is doubled by integer arithmetic operating on 
the numerical equivalent of the digit. In this operation neither rounding nor overflow can 
occur. See the function mul2 in Figure [4] This function is not a general-purpose doubling 
routine: it is specific to its argument being the decimal digits d\, . . . , dn of a mantissa of 
the form Y^=\ di^ % - Accordingly, when the result is 1 or greater, this additional digit is 
not inserted into the resulting string. Instead the last carry is returned. By inspecting it, the 
calling code can determine whether the mantissa has overflowed. 

With the doubling function available, the binarization function is straightforward. See 
Figure [5] 

In the next stage, normalization, we ensure that the mantissa is in the interval [0.5, 1). 
In our current example this happens by means of the steps 0.1968 * 2~ 4 = 0.3936 * 2 -5 = 



int mul2 (strings mnts) { 

// Multiplies decimal mantissa in argument by 2. 
// Returns last carry, 
int carry = 0; 

int dd; // Result of doubling a decimal digit, 
for (int i = mnts . length () -1 ; i >= 0; i — ){ 

dd = 2* (mnts [i] -' 0' ) + carry; 

mnts[i] = dd%10 + '0'; carry = dd/10; 

} 

if (mnts [mnts. length ()-l] == '0') 

mnts. erase (mnts . length ( ) -1 , 1 ) ; 
// mnts is a mantissa, hence no trailing zeros 
return carry; 



Figure 4: A C++ function to double a number of the form Y17=i^i^ % wnere the 
di, . . . ,d n are given in the string argument mnts. 



void binarizeExp ( strings mnts, int& exp) { 
// When called, exp is a decimal exponent. 

// On exit, exp is a binary exponent and mnts is adjusted, 
// so that the same number is denoted. 

int binExp = 0; int carry = 0; 

while (exp < ) { 

carry = mul2 (mnts) ; binExp — ; 

if (carry != 0) { exp+ + ; mnts . insert ( ," 1 ") ; } 

} 

while (exp > ) { 

div2(mnts); binExp++; 

if (mnts[0] == '0') { exp — ; mnts .erase (0,1) ; } 

} 

// exp = 
exp = binExp; 



Figure 5: A C++ function to binarize a number of the form 10 e Y27=i % where tne 
digits di, . . . , d n are given in the string argument mnts. If we denote the values on exit by 
adding primes, we have 10 e £" =1 c^lCr* = 2 e ' d'iW^ 



0.7872 * 2 6 The function for normalizing is in Figure [6l 



void normalize (strings mnts, int& exp) { 
// Maintaining the value of the denoted number, 
// adjusts exp so that 0.5 <= O.mnts < 1. 
while ( (mnts [0] -' 0' ) < 5) { 

exp--; mul2 (mnts) ; // discard zero carry 



Figure 6: A C++ function to normalize a number of the form Y17=l * wnere trie man- 
tissa d\, . . . , d n is given in the string argument mnts. 

After binarization and normalization we are ready to start the conversion of the decimal 
mantissa to binary. A good starting point is one of the radix conversion methods given by 
Knuth Q, section 4.4, where he converts from radix b to radix B. In our case we have 
b = 10 and B = 2. The options are to divide by 10 using radix-2 arithmetic and to multiply 
by B = 2 using radix-10 arithmetic. We select the latter. 

Thus we have a fractional number u given as a string of decimal digits. Knuth obtains 
the digits U\, U2, ■ ■ ■ of the binary representation as follows: 



where {x} denotes x mod 1, which is x — [x\ . 

We must not only obtain the successive binary digits U\, U2, and C/3, but we must also 
pack them as a floating-point number. Hence we modify Knuth's iteration to the following, 
of which we show the first few steps, continuing the previous example. 



0.7872 = 0.7872 * 1 

= 1.5744 * (1/2) 

= 1/2 + 0.5744 * (1/2) 

= 1/2 + 1.1488* (1/4) 

= 1/2 + 1/4 + 0.1488 * (1/4) 

= 1/2 + 1/4 + 0.2976 * (1/8) 

= 1/2 + 1/4 + 0.5952 * (1/16) 

= 1/2 + 1/4 + 1.1904* (1/32) 

= 1/2 + 1/4 + 1/32 + 0.1904 * (1/32), 



[uB\ 

[{uB}B\ 

[{{u}B}B\ 



• • • 1 



Giving 0.11001... as the binary representation of 0.7872. We see that in this way a 
sum of powers of 2 is built up while the remaining decimal mantissa is multiplied by an 
ever smaller factor. The iteration is terminated when this product is less than the machine 
precision. At that point the sum of the powers of 2 is the left bound of the narrowest interval 
containing 0.7872. The multiplications and additions, though floating-point operations, are, 
by their special nature, performed without rounding error. 

In the final stage, we scale the lower bound by the factor 2~ 6 resulting from binarization 
and normalization. We perform this scaling by successive divisions or multiplications by 
2, again assuring the absence of rounding errors. Figure |7] displays a function along these 
lines that finds the narrowest floating-point interval. This function that the fraction x to be 

interval convertFrac (strings mnts, int& exp) { 

interval result; // interval to be returned 
// r == 10~{exp} * O.mnts 

binarizeExp (mnts , exp); 
// r == 2" {exp} * O.mnts 

normalize (mnts, exp) ; 
// r == 2~{exp} * O.mnts and 0.5 <= O.mnts < 1 
// r' == O.mnts and 0.5 <= O.mnts < 1 

float frac = 0.0; float pwr = 1.0; 

int carry = 0; 

while (frac+pwr > frac && mnts . length ( ) > 0) { 
// <= r' - frac = O.mnts * pwr 

pwr /= 2.0; // pwr == 2~{-i} 

carry = mul2(mnts); 

if (carry != 0) { frac += pwr; 

// subtract 1 by discarding nonzero carry 

} 

} // <= r' - frac = O.mnts * pwr 

// and pwr is negligible compared to frac 

// hence frac is the greatest flpt less than r' 

// scale frac with exp 

while (exp > 0) {frac *= 2.0; exp — ;} 
while (exp < 0) {frac /= 2.0; exp++; } 
result. lb = frac; 

result. ub = (mnts . length ( ) == 0) ? frac : next (frac) ; 
return result; 



Figure 7: A C++ function to find the narrowest floating-point interval that contains a given 
number of the form 10 e Y27=i where the di, . . . , d n are given in the string argument 

mnts and e is given in the argument exp. 

converted is non-negative. In that case the result is [a, a'], where a' is the least floating- 
point number greater than floating-point number a. This function can also be used for a 
negative fraction y. If our function gives [a, a'] with input — y, then we change the output 



to [—a', —a]. 

This function uses some auxiliary declarations: one that defines interval and one 
that defines a function to determine the next floating-point number after a given one. The 
auxiliary definitions are displayed in Figure [8j 

typedef struct interval { float lb, ub; } ; 
float next (float x) { 

// returns the least flpt number greater than nonnegative x 
// if x is normalized and if it is less than the greatest 
// flpt number 
// xO = x 

float sf = 1.0; // becomes the scale factor 
// xO = x*sf 

while (x < 1.0) { x *= 2.0; sf /= 2.0; } 
// xO = x*sf and 1 <= x 

while (x >= 2.0) { x /= 2.0; sf *= 2.0; } 
// xO = x*sf and 1 <= x < 2 

float eps = 1.0; // becomes the machine epsilon 
while (((float) 1.0 + eps) > (float)l.O) eps *= 0.5; 
// eps is the first one that did not make a difference 
eps*=2 . 0; 

// eps is the last one that did make a difference 
// by definition the machine epsilon 
return (x+eps) *sf ; 

} 

Figure 8: Some definitions auxiliary to Program [7] 



4 Decimal output 

As in decimal input, the main concern is to avoid rounding errors. This is of course taken 
care of by representing all digits, binary or decimal, separately as small integers. Operations 
on these are free from rounding errors because the operands are integer; they are immune to 
overflow because of their smallness. However, we would also like to speed up conversion 
as much as possible by using operations on floating-point numbers when we can be sure 
that no rounding errors occur or by using operations of full-size integers when we can be 
sure that no overflow can occur. 

Our starting point is the assumption that a floating-point number can be represented by 
an integer for the exponent part and by an integer that is represented by the same sequence 
of bits as there are in the mantissa of the floating-point number. In the case of the IEEE 
standard single-length format, this integer is 24 bits with the most significant bit equal to 1. 
That is, for given floating-point /, we find integers e and m such that / = m*2 e . As before, 
we use the case of the IEEE standard 754 single-length floating-point format as example. 



Consider the assignment m = f , which is legal in C/C++. It implicitly converts the 
floating-point number f to an integer if m is an integer. If f is not an integer, then this 
assignment does not result in m containing the bits of the mantissa of f. If f is an integer, 
then this assignment may also not result in m containing the bits of the mantissa of f : 
f may be larger than the largest integer. However, if we ensure that f is in the interval 
[2 23 , 2 24 ), it is both assured to be an integer, while at the same time not causing overflow 
when assigned to a 32-bit integer. Thus we may extract the bits of f and place them in m 
by simply performing the assignment m = f provided that we first scale f to be in this 
range. The scaling is performed by doubling or halving of f and therefore does not cause 
rounding errors. The required scaling operations are accounted for in the binary exponent 
e. See Figure|9]for a function to perform the conversion from / to the corresponding e and 
m. 

void fltoem (const floats fO, int& e, int& m) { 

// returns m and e such that fO = m * 2~e 

// with 2~{23} <= m < 2"{24} that is 

// m has the same bits as 1 . signif icand of fO 

float f = fO; 

float e23 = 1.0; 

for (int i = 0; i < 23; i++) e23 *= 2.0; // e23 = 2"{23} 
float e24 = 2.0 * e23; 
e = 0; // fO == f * 2~e 

while (f < e23) {f *= 2.0; e — ;} // fO == f * 2~e 
while (f >= e24) {f /= 2.0; e++; } // fO == f * 2~e 
// f is an integer and 2~{23} <= f < 2~{24} 
m = (int)f; // fO == m*2~e and 2"{23} <= m < 2~{24} 



Figure 9: A function to realize the relation fo = m* 2 e , with /o as input and m and e as 
output. Example for single-length floating-point format. 

It remains to determine the string of decimal digits that is equal to m * 2 e with m and 
e given as variables of type integer. If we place no limit on the length of the string, this is 
always possible: decimal strings are closed under doubling and halving. We first convert 
the integer variable m to a decimal string. 

This is done by one of the best known algorithms in programming. One of the forms in 
which it appears is the function itoa in [6]. A streamlined version is the following, though 
it prints the digits in reverse order: 

void print (int n) { 

while (n>0) { cout << n%10; n = n/10; } 
cout << endl; 

} 

What we need is basically the same algorithm, but elaborated by the need to output to a 
string rather than to print. We also need to ensure that the digits are not reversed. Thus we 
arrive at the function in Figure [TOl 



void decimalizeMnts (int& n, strings s) { 

s. erase (0, s . length ()) ; // ensures s is empty 
while (n > 0) { 

s . insert (s .begin () , 1, (char) (' 0' + (n%10))); 
// inserts digit n%10 at the beginning of s 
n = n/10; 

} 

} 

Figure 10: Decimalizes the integer mantissa. 

The result of decimalizeMnts is the representation of the integer m such that / = 
m * 2 e . What we want is a mantissa. That is, we need to move the decimal point from the 
right to the left. This is done by multiplying by a suitable power of 10, which is the length 
of the decimal numeral. Hence the function in Figure [TT] 

int normalize (const strings mnts) { 

// Returns the decimal exponent necessary to turn mnts 
// from a whole number into a normalized mantissa. 
// Assumes mnts has no leading zeros, 
return mnts . length () ; 

} 

Figure 1 1 : Transforms the integer mantissa to a fractional one. 

In the final stage we convert the binary exponent e to increments or decrements of the 
decimal exponent, with a concomitant change in the mantissa. This happens in the function 
shown in Figure [12] 

To summarize, we transform a floating-point number / to an integer exp and a man- 
tissa mnts such that / = O.mnts * \0 ex P by first transforming it to an integer man- 
tissa with a binary exponent (function f ltoem), then decimalizing the mantissa (function 
decimalizeMnts), then transforming the integer mantissa to a fractional one (function 
normalize), and finally folding the binary exponent into the decimal exponent produced 
by the normalization stage. This completes the decimal output of a floating-point number. 
See the function decOut in Figure [T3l 

The function decOut in Figure[[3]gives all of the decimals of a numeral that is equal to 
the floating-point number in question. To obtain the lower bound of the required narrowest 
interval bounded by decimal numerals of specified length, the output of decOut needs to 
be truncated to this length. The upper bound is then obtained by adding to it one unit of the 
last decimal place. 



void decimalizeExp ( strings mnts, int& binExp, int& decExp) { 
// Reduces the binary exponent binExp to zero. 
// Assumes that mnts has no leading zeros, 
while (binExp > 0) { 
binExp — ; 

if (mul2Mnts (mnts) == 1) { 

decExp++; mnts . insert (0, "1" ) ; 

} 

if (mnts [mnts . length () -1] == '0') 
mnts. erase (mnts . length ( ) -1 , 1 ) ; 

} 

while (binExp < 0) { 
div2 (mnts) ; binExp++; 
if (mnts [0] == ' 0' ) { 

mnts .erase (0,1) ; decExp — ; 

} 

} 

} 



Figure 12: Decimalizes the binary exponent. 



void decOut (float f, int& exp, strings mnts) { 
// returns the decimal string equal to f 
int e, m; 

fltoem(f, e, m) ; // f = m * 2" e 
decimalizeMnts (m, mnts); 
// mnts is decimal equivalent of m 
exp = normalize (mnts ) ; 
decimalizeExp (mnts, e, exp); 



Figure 13: Finds the decimal equivalent of floating-point number. 



5 Conclusions 



To input an interval given as a pair of decimal numerals, one needs to assure that the result- 
ing pair of floating-point numbers contains the input pair. The fact that a decimal numeral 
is typically not representable as a binary floating-point number makes it necessary that the 
internal result is wider than the input. We want it to be wider by as little as possible. The 
standard library functions are not adequate for this purpose. In this paper we have developed 
functions that support such interval input. 

For output, there is in principle no problem: for every finite binary floating-point nu- 
meral there is a decimal numeral that is equal to it. Therefore, if one is willing to accept 
lengthy numerals, it is possible to obtain as output an interval that is equal to the internally 
stored floating-point interval. But to obtain an output interval that is both correct and not 
too long, it is in general necessary to round the decimal bounds in the correct directions. 

In this paper we have developed functions that can be used as basic building blocks for 
the required input and output for intervals. These include 

1. For every decimal numeral, computing the unique narrowest floating-point interval 
containing it. 

2. For every finite floating-point number, the decimal numeral that is equal to it. 

In designing these building blocks, we have been guided by the criteria of efficiency, 
language-independence, and machine-independence. 

Efficiency To avoid rounding errors and overflow one can perform all arithmetic on num- 
bers represented by strings of digits. These digits are small integers and are therefore oper- 
ated upon without rounding error and without overflow. The disadvantage is the resulting 
slowness of the algorithms. We have therefore taken advantage of several situations in 
which floating-point operations do not give rounding errors and are much faster. 

Language-independence The most extreme form of language-independence is to write 
the algorithms in pseudo-code. To ensure that our algorithms can be verified, we have 
chosen a simple subset of C++ that almost fits inside the C language. The few excursions 
beyond the bounds of C are amply rewarded, for example, by the availability of the string 
class. Of course, the most important part of verification is the understanding of the code. In 
this respect C is not better than pseudo-code, but not worse either. The reason for presenting 
our algorithms in executable form is that execution is a useful check on verification by 
reading the code. 

We have facilitated reading the code by including as assertions comments. Of each of 
these we claim that it holds whenever execution passes the assertion. 

But the fact that we want executable algorithms, and hence rely on a specific program- 
ming language, does not imply that we feel free to use all the facilities of that language. 
For example, to determine the next floating-point number after a given one, we can use the 
facilities of C to access the bits of a floating-point number. But such facilities vary widely 
among programming languages. We therefore prefer to determine the next floating point 
number by means of arithmetic operations, so that the resulting algorithm is more readily 
transferable to another language. 



Machine-independence For input it is desired to rind the smallest floating-point interval 
containing the number denoted by the input numeral. Usually such a requirement is met by 
modifying the default rounding mode of the processor. We have refrained from using this 
option, as the specification of the programming language does not include these operations. 
As a result, the different implementations of C come with implementation-specific functions 
to perform these operations. 

The motivation of this paper was the usual situation where the semantics of an interval 
is that it contains one or a finite number of solutions. In this situation correctness of the 
conversion process means obtaining an interval that contains the original one. It also hap- 
pens that the semantics of the original interval is that all points in it are solutions, as can 
happen with inequalities. In such a situation the correctness criterion is to transform the 
original interval to one that is contained in it, again differing by as little as possible. The 
functions presented here can serve as building blocks in this situation also. 
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